TO KEEP IN MIND - la fiecare grafic: ce vrem sa vedem (scris la inceput), tipul de grafic, datele folosite, interpretare

We can also add population for each state si sa vedem % incidents raportat la populatie

Another Data frame for state codes to use for graphs instead of full names

Daca avem timp la final, it might be nice sa facem si o predictie. Daca nu facem, sa stergem asta din descrierea initiala

We could also do stuff like PCA maybe,care ar fi mai usoare decat predictia

Niste categorii cu ce facem scrise la inceput

!!! Interactive plots. Zoom, tooltip, choosing a variable to display more (https://r-graph-gallery.com/interactive-charts.html)

Introduction

The objective of this project is to perform an anlysis of Gun Violence in US in the past decade. The statistics have as their main purpose discovering trends and correlations between several factors and gun violence and to predict based on the data we have the number of incidents in the future.

We have chosen to employ both Python and R as programming languages, the former being used for data processing and modelling purposes, and the latter for the actual visualization of the data gathered.

Three data sets have been used, the primary one which includes the topic specific information, and two supporting sets that contain the population for each state according to the 2019 census and the state codes.

The data source for the project is Kaggle.

Settings

R specific libraries import

library(hrbrthemes)
## Registered S3 methods overwritten by 'tibble':
##   method     from  
##   format.tbl pillar
##   print.tbl  pillar
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(reticulate)
library(ggplot2)
library(hrbrthemes)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Python specific libraries import

import pandas as pd
import numpy as np
import datetime as dt

Data extraction from the CSV files

fileName = 'gun-violence.csv'
fullsetDF = pd.read_csv(fileName)
fileName = 'us_population_by_state.csv'
populationDF = pd.read_csv(fileName)
fileName = 'state_codes.csv'
stateCodesDF = pd.read_csv(fileName)

Data Cleaning

Check if there are any missing values in any other field. If there are, the value will be replaced with “N/A” - do we need to do this tho?

Remove unnecesary data

Fields “incident_id” and “incident_url_fields_missing”, are not need for our analysis so they were removed.

fullsetDF = fullsetDF.drop("incident_id", axis = 1)
fullsetDF = fullsetDF.drop("incident_url_fields_missing", axis = 1)

Processing

We have split the date into months and years sine this information will be used later on when studying the changes over the years and if there are any “seasonal changes”.

def splitDates(datesList):
    months = []
    years = []
    for i in range(0, len(datesList)):
        value = datesList[i]
        splitDate = dt.datetime.strptime(value, "%Y-%m-%d")
        monthName = splitDate.strftime('%b')
        months.append(monthName)
        years.append(splitDate.year)
    fullsetDF.insert(1, "MONTH", months)
    fullsetDF.insert(2, "YEAR", years)

allDates = fullsetDF['date']
splitDates(allDates)

Congressional district? Is this needed? Maybe only if we add a little bit more information

We would like to see if the type of venue has any impact on the number of incidents that occur. However, there are no clearly defined types and as such, we had to define them ourselves and see which fall into the specific category.The location can be mentioned in several fields, so we will use all of them for our search: location_description, incident_characteristics and notes.

A new field was added called “Location Type”

After this classification, the field “location_description” is no longer useful, so it was deleted. - MEH

locationType = ["Bar/Club", "College", "School", "Apartments", "Park", "Restaurant"]
locations = fullsetDF["location_description"]
characteristics = fullsetDF["incident_characteristics"]
notes = fullsetDF["notes"]

cand luam location we need to ignore ce e in (). I need to take this into account - adauga asta in descriere

need to figure out cum sa facem aici cu bar/club ca in characteristics e pus impreuna, in location e doar unul dintre ele - adauga asta in descriere

for i in range (0, len(locations)):

value = characteristics[i]

if type(value) == float:

for j in range (0, len(locationType)):

if locationType[j].tolower() in value.tolower():

fullsetDF = fullsetDF.drop(“location_description”, axis = 1)

The field incident_characteristics that can be used to find types of incidents. As described previously, we have defined certain categories and searched the text to find in which categories incident fall in.

Unlike the previous example where only one venue was possible, here there can be more than one or more characteristics describing each incident, which required us to define different fields with a YES/NO response (binary categorical variables).

incidentType = ["Home Invasion", "Mass Shooting", "Officer Involved", "Armed Robbery", "Drive-by", "Domestic Violence","Gang"]
fieldNo = fullsetDF.columns.get_loc("incident_characteristics")
matrix = []

for i in range(0, len(incidentType)):
    incidentTypeDiscovered = []
    for j in range(0, len(characteristics)):
        value = characteristics[j]
        if type(value) != float:
          if incidentType[i].lower() in value.lower():
            incidentTypeDiscovered.append("YES")
          else:
            incidentTypeDiscovered.append("NO")
        else:
          incidentTypeDiscovered.append("UNKNOWN")
    fieldName = incidentType[i]
    fullsetDF.insert(fieldNo, fieldName, incidentTypeDiscovered)

The number of guns is not clearly stated. However, we can calculate it based on the number of values in the field “gun_stolen” (if there is no information, the number shown will be 0; it does not signify the absence of a gun but rather lack of sufficient information). A new field is created “No of Guns” which will be positioned right before the “gun_stolen” field.

allInfoGuns = fullsetDF["gun_stolen"]
noOfGuns = []
for i in range (0, len(allInfoGuns)):
    arrValuesEntry = []
    value = allInfoGuns[i]
    if type(value) == float:
        # We should check here daca chiar e ok sa punem 0 sau ar trebui N/A or something like that. Am putea sa punem 1 because we assume there was at least 1 gun involved, it seems reasonable to make this assumption
        noOfGuns.append(0)
    else:
        arrValuesEntry = value.split('||')
        noOfGuns.append(len(arrValuesEntry))
fieldNo = fullsetDF.columns.get_loc("gun_stolen")
fullsetDF.insert(fieldNo, "NO OF GUNS", noOfGuns)

Field “participant_gender” offers information for all the people involved. We would like to get the actual number and add these fields to the dataframe.

genderParticipants = fullsetDF["participant_gender"]
female = []
male = []
for i in range (0, len(genderParticipants)):
    value = genderParticipants[i]
    if type(value) != float:
        no = value.count("Female")
        female.append(no)
        no = value.count("Male")
        male.append(no)
    else:
        female.append(0)
        male.append(0)

fieldNo = fullsetDF.columns.get_loc("participant_gender")
fullsetDF.insert(fieldNo, "FEMALES", female)
fullsetDF.insert(fieldNo, "MALES", male)

We would also like to make the distinction between the number of adults, teenagers and children involved in the incidents and we will use the same process as above on the field “participant_age_group”.

ageParticipants = fullsetDF["participant_age_group"]
children = []
teenagers = []
adults = []
for i in range (0, len(ageParticipants)):
    value = ageParticipants[i]
    if type(value) != float:
        no = value.count("Adult 18+")
        adults.append(no)
        no = value.count("Teen 12-17")
        teenagers.append(no)
        no = value.count("Child 0-11")
        children.append(no)
    else:
        adults.append(0)
        teenagers.append(0)
        children.append(0)
fieldNo = fullsetDF.columns.get_loc("participant_age_group")
fullsetDF.insert(fieldNo, "ADULTS", adults)
fullsetDF.insert(fieldNo, "TEENAGERS", teenagers)
fullsetDF.insert(fieldNo, "CHILDREN", children)

latitude and logitude - is this really helpful?

Renaming the fields. This step was necessary since we aim to use the exact names for plotting and they should be representative and without any additional characters such as "_".

This is completely useless. Nu ma incanta cu ABSOLUT nimic, mai tare ma incurca to be honest fullsetDF = fullsetDF.rename(columns={ “date”:“DATE”, “state”:“STATE”, “n_killed”:“NO PEOPLE KILLED”, “n_injured”:“NO PEOPLE INJURED”, “gun_stolen”:“STOLEN GUN”, })

De terminat aici

#city_or_county address n_killed n_injured incident_url source_url incident_url_fields_missing congressional_district gun_stolen gun_type incident_characteristics latitude location_description longitude n_guns_involved notes participant_age participant_age_group participant_gender participant_name participant_relationship participant_status participant_type sources state_house_district state_senate_district

The clean dataframe was saved in a new csv file.

fullsetDF.to_csv("gun-violence_processed-data.csv")

Printing a snapshot of the final dataframe to showcase the data we are now working with.

Definition of variables

This section contains all the variables that will be used going further for plotting. Any data selection or processing will be done here.

Calculating total numbers of * incidents that occured * cases that resulted in killings * cases that resulted in injuries in each state, regardless of month or year.

totalPerState = []
totalKillingsPerState = []
totalInjuriesPerState = []

for i in range (0, len(stateCodesDF)):
    state = stateCodesDF["State"][i]

    dfHelper = fullsetDF[fullsetDF["state"]==state]['state'].copy(deep=True)
    totalPerState.append(dfHelper.count().item())

    dfHelper = fullsetDF[(fullsetDF["state"] == state) & (fullsetDF['n_killed'] > 0)]['state'].copy(deep=True)
    totalKillingsPerState.append(dfHelper.count().item())

    dfHelper = fullsetDF[(fullsetDF["state"] == state) & (fullsetDF['n_injured'] > 0)]['state'].copy(deep=True)
    totalInjuriesPerState.append(dfHelper.count().item())

stateCodes = stateCodesDF["Code"]

Graphs

Initial exploration of the dataset

Chart Type: Barplot with one numeric variable

We have defined a function for this type of plot and called it for each of the scenarios.

initial_eda_barplots <- function(all_labels, all_values, x_name, y_name, plot_title, colour){
  
  data=data.frame(name = all_labels, value = all_values)

  totalIncidentsBarPlot <- ggplot(data, aes(x=name, y = value)) + geom_bar(stat = "identity", width=0.7, fill = colour) + 
    xlab(x_name) + 
    ylab(y_name) +
    ggtitle(plot_title)

  totalIncidentsBarPlot +
    theme(
      plot.title = element_text(hjust = 0.5),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "white"), 
      axis.text.x = element_text(angle = 90, vjust = 0.5)
  )
}

Variable: Total number of incidents that occured

We want to see the distribution of the total number of incidents per each state. We will use the totalPerState dataframe and the codes from the stateCodes array.

r_totalPerState <- py$totalPerState
r_labels <- py$stateCodes

initial_eda_barplots(r_labels, r_totalPerState, "State Code", "Number of incidents", "Distribution of incidents per State", "#091CC1")

The general level of incidents can roughly be found somewhere in the interval [0,5000], but we do see some outliers - states have a significantly higher number of gun-involved cases.

Variable: Total number of cases that resulted in killings

r_totalKillingsPerState <- py$totalKillingsPerState

initial_eda_barplots(r_labels, r_totalKillingsPerState, "State Code", "Number of killings", "Distribution of cases that resulted in killings per State", "#652194")

Variable: Total number of cases that resulted in injuries

r_totalInjuriesPerState <- py$totalInjuriesPerState

initial_eda_barplots(r_labels, r_totalInjuriesPerState, "State Code", "Number of injuries", "Distribution of cases that resulted in injuries per State", "#134611")

Outlier values can be noticed in the last two charts as well. Given this observation, it would make sense to realize another chart - a Boxplot - that would specifically highlight this information. Even though boxplots can sometimes be missleading because of the loss of information, our purpose for this scenario is to identify those states that do not fall into the IQR (Interquantile Range), purpose which makes this type of chart appropiate.

Chart Type: Boxplot with three series

Variables: Total number of incidents, firearm deaths and injuries, respectively

r_totalPerState <- py$totalPerState
r_totalKillingsPerState <- py$totalKillingsPerState
r_totalInjuriesPerState <- py$totalInjuriesPerState

p <- boxplot(r_totalPerState, r_totalKillingsPerState, r_totalInjuriesPerState,
main = "Outlier variables - States that have higher criminality",
names = c("Number of incidents", "Number of killings", "Number of injuries"),
col = c("#091CC1","#652194", "#134611")
)

p
## $stats
##         [,1]   [,2]   [,3]
## [1,]   289.0   47.0   47.0
## [2,]  1610.0  234.0  515.0
## [3,]  3201.0  704.0 1120.0
## [4,]  6383.5 1555.5 2770.5
## [5,] 10244.0 3354.0 5870.0
## attr(,"class")
## Number of incidents 
##           "integer" 
## 
## $n
## [1] 51 51 51
## 
## $conf
##          [,1]     [,2]      [,3]
## [1,] 2144.891 411.6257  620.9836
## [2,] 4257.109 996.3743 1619.0164
## 
## $out
## [1] 16306 15029 17556 13577  4972  4350 11023
## 
## $group
## [1] 1 1 1 1 2 2 3
## 
## $names
## [1] "Number of incidents" "Number of killings"  "Number of injuries"

Chart Type: Doughnut Plot

We have decided to further investigate the outlier states so as to better understand the factors that led to an increased number of incidents involving firearms.

The first type of analysis targets the age categories. For this purpose, the data was visualized in separate doughnut charts as seen below.

Calculating the percentage of Child, Teenager and Adult participants for each state.

outlierList = np.array(["Pennsylvania", "California", "Colorado"])
totalsStates = np.zeros(shape=(len(outlierList),3), dtype = float)
categories = np.array(["CHILDREN", "TEENAGERS", "ADULTS"])

for i in range(0,len(outlierList)):
    state = outlierList[i]
    array = []

    dfHelper = fullsetDF[fullsetDF["state"]==state][categories[0]].copy(deep=True)
    totalsStates[i][0] = dfHelper.sum().item()

    dfHelper = fullsetDF[fullsetDF["state"]==state][categories[1]].copy(deep=True)
    totalsStates[i][1] = dfHelper.sum().item()

    dfHelper = fullsetDF[fullsetDF["state"]==state][categories[2]].copy(deep=True)
    totalsStates[i][2] = dfHelper.sum().item()

# Percentages per state for each group
for i in range(0, len(outlierList)):
    total = totalsStates[i][0] + totalsStates[i][1] + totalsStates[i][2]
    for j in range(0, 3):
        totalsStates[i][j] = float("{:.4f}".format(totalsStates[i][j] / total))*100
donut_plot_labels <- function(all_labels, all_values) {
  
data = data.frame(name = all_labels, value = all_values)

data$ymax = cumsum(all_values)
data$ymin = c(0, head(data$ymax, n=-1))

data$labelPosition <- (data$ymax + data$ymin) / 2
data$label <- paste0(data$name, ": ", data$value, "%")

ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=name)) + 
  geom_rect() +
  geom_text( x=2, aes(y=labelPosition, label=label, color=name), size=3) +
  scale_fill_brewer(palette = "Dark2") +
  scale_color_brewer(palette = "Dark2") +
  coord_polar(theta="y") +
  xlim(c(-1, 4))
}

Variables: Percentage of each age group involved in an incident per state and across the states.

r_categories <- py$categories
matrix <- py$totalsStates
for(row in 1 : nrow(matrix)){
  r_totalsStates<- matrix[row,]
  print(donut_plot_labels(r_categories, r_totalsStates))
}

We would also like to have an overview of the age groups across the outlier states to see if there are any differences between the states. We could assume that some of the states have a higher percentage of teenagers involved in gun-voilence than others.

# Percentages per group for each state
for j in range(0, 3):
    total = totalsStates[0][j] + totalsStates[1][j] + totalsStates[2][j]
    for i in range(0, len(outlierList)):
        totalsStates[i][j] = float("{:.4f}".format(totalsStates[i][j] / total))*100
totalsStates = totalsStates.transpose()
r_outlierList <- py$outlierList

matrix <- py$totalsStates
for(row in 1 : nrow(matrix)){
  r_totalsStates<- matrix[row,]
  print(donut_plot_labels(r_outlierList, r_totalsStates))
}

The differences between the states are - FINISH HERE

Chart Type: Circular BarPlot with one numeric variable

The purpose of the next section is to identify the number of guns used in all the incidents per state.

gunsTotalPerState = []
for i in range (0, len(stateCodesDF)):
    state = stateCodesDF["State"][i]
    dfHelper = fullsetDF[fullsetDF["state"]==state]["n_guns_involved"].copy(deep=True)
    gunsTotalPerState.append(dfHelper.sum().item())
gunsTotalPerState = np.array(gunsTotalPerState)
r_gunsTotalPerState <- py$gunsTotalPerState

Chart Type: Heatmap

The 2 categorical variables that are taken into account for this type of chart are the IncidentType and the States. We have previously split the incidet type into 7 different categories. We would now like to make an analysis of the causes of gun-violence, i.e. how often each type occurs for the individual states.

typeTotalPerState = np.empty(shape=(len(incidentType)*len(stateCodesDF)))
it = 0
for i in range(0, len(incidentType)):
    incident = incidentType[i]

    for j in range (0, len(stateCodesDF)):
        state = stateCodesDF["State"][j]
        dfHelper = fullsetDF[(fullsetDF["state"]==state) & (fullsetDF[incident] == "YES")][incident].copy(deep=True)

        typeTotalPerState[it] = dfHelper.count()
        it = it + 1
r_typeTotalPerState <- py$typeTotalPerState
x <- py$stateCodes
y <- py$incidentType
data <- expand.grid(X=x, Y=y)

data$Z <- r_typeTotalPerState

ggplot(data, aes(X, Y, fill=Z)) + 
 geom_tile() + 
  scale_fill_distiller(palette = "Greens") + 
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5)
  )

Chart Type: Scatter Plot

Chart Type: Histogram with two numeric variables

We want to see the distribution of people involved in an incident, split by gender. We will use two Y axis and …dataframes.

Chart Type: Doughnut Chart

We want

Variables: Total number of incidents, total number of incidents per state => procent

Calculating total number of incidents per state and adding them into a dictionary. The states with incident percentages less than 1% will be categorized as others.

totalPerState = []
incidentsStatesArray = np.array(fullsetDF['state'].to_numpy())
for i in range(0, len(stateCodesDF)):
    state = stateCodesDF["State"][i]
    totalPerState.append(incidentsStatesArray[incidentsStatesArray == state].shape[0])
        
DictStateIncident = {"Others": 0}
dfHelper = fullsetDF[fullsetDF["state"]==state]['state'].copy(deep=True)
totalIncidents = len(dfHelper)

incidentsPerStatePercentages = (np.array(totalPerState) / totalIncidents)
for i in range(0,len(stateCodes)):
    if incidentsPerStatePercentages[i] > 1:
        DictStateIncident[stateCodes[i]] = round(incidentsPerStatePercentages[i], 2)
    else:
        DictStateIncident["Others"] = DictStateIncident["Others"] + round(incidentsPerStatePercentages[i], 2)
        
statesForDonut = np.array(list(DictStateIncident.keys()))
valuesForDonut = np.array(list(DictStateIncident.values()))
donut_plot <- function(all_labels, all_values) {
  
data = data.frame(name = all_labels, value = all_values)

data$ymax = cumsum(all_values)
data$ymin = c(0, head(data$ymax, n=-1))

ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=name)) + 
  geom_rect() +
  coord_polar(theta="y") +
  xlim(c(2, 4))
}
r_labels <- py$statesForDonut
r_percentagesPerState <- py$valuesForDonut

donut_plot(r_labels, r_percentagesPerState)

Calculating the number of incidents per region: NE, NW, SW, SE.

We consider the center position the geographic center of the United States. This is a point approximately 32 km north of Belle Fourche, South Dakota at LAT. 39°50’ LONG. −98°35’

DictIncidentsPerRegion = {"NE": 0, "NW": 0, "SW": 0, "SE": 0}

incidentsCoordinates = fullsetDF[["latitude", "longitude"]]

for i in range(0, len(incidentsCoordinates)):
    if incidentsCoordinates["latitude"][i] > 39 and incidentsCoordinates["longitude"][i] > -98:
        DictIncidentsPerRegion["NE"] += 1
    if incidentsCoordinates["latitude"][i] > 39 and incidentsCoordinates["longitude"][i] < -98:
        DictIncidentsPerRegion["NW"] += 1
    if incidentsCoordinates["latitude"][i] < 39 and incidentsCoordinates["longitude"][i] < -98:
        DictIncidentsPerRegion["SW"] += 1
    if incidentsCoordinates["latitude"][i] < 39 and incidentsCoordinates["longitude"][i] > -98:
        DictIncidentsPerRegion["SE"] += 1

totalNoIncidents = sum(DictIncidentsPerRegion.values())
  
regions = np.array(list(DictIncidentsPerRegion.keys()))
incidentsPerRegions = np.round(np.array((np.array(list(DictIncidentsPerRegion.values())) / totalNoIncidents) * 100), 2)
r_regions <- py$regions
r_incidentsPerRegion <- py$incidentsPerRegions

donut_plot_labels(r_regions, r_incidentsPerRegion)

Chart Type: Lollipop chart with one numeric variable

In this section we show you the top 25 and the buttom 25 news websites that report crimes. In this way we could decide which online newspaper is the most and the least trustful.

sourcesPerIncident = fullsetDF[["date", "sources"]]

DictSourcesAndIncidents = {"pittsburgh.cbslocal.com" : 0}
for i in range (0, len(sourcesPerIncident["sources"])):
    try:
        sourcesArray = sourcesPerIncident["sources"][i].split("||")
    except:
        continue
    for source in sourcesArray:
        URI = source.strip().partition("//")[2]
        URL = URI.partition("/")[0]
        if URL in DictSourcesAndIncidents:
            DictSourcesAndIncidents[URL] += 1
        else:
            DictSourcesAndIncidents[URL] = 0

lessIncidentsPerSource = dict(filter(lambda elem: elem[1] != 0, DictSourcesAndIncidents.items()))

sort_by_value = lambda DictSourcesAndIncidents: DictSourcesAndIncidents[1]
sortedSourcesDict = sorted(DictSourcesAndIncidents.items(), key = sort_by_value, reverse=True)

sourcesName = []
sourceNews = []
for x in list(sortedSourcesDict)[0:25]:
    sourcesName.append(x[0])
    sourceNews.append(x[1])
    
sortedSourcesLeastTrustful = sorted(lessIncidentsPerSource.items(), key = sort_by_value)

sourcesLeastTrustful = []
appearancesLeastTrusful = []
for x in list(sortedSourcesLeastTrustful)[0:25]:
    sourcesLeastTrustful.append(x[0])
    appearancesLeastTrusful.append(x[1])
library(plotly)

lollipop_function <- function(sources, appearances, title) {
    
    data <- data.frame(x = sources, y = appearances)
    
    p <- ggplot(data, aes(x=x, y=y)) +
      geom_segment( aes(x=reorder(x, y), xend=x, y=0, yend=y), color="grey") +
      geom_point( color="orange", size=4, fill=alpha("orange", 0.3), alpha=0.7, shape=21, stroke=2) +
      theme_light() +
      coord_flip() +
      ggtitle(title) +
      scale_size(range=c(2,12), guide="none") +
      theme(
        panel.grid.major.x = element_blank(),
        panel.border = element_blank(),
        axis.ticks.x = element_blank(),
        plot.title = element_text(size = 12)
      ) +
      xlab("") +
      ylab("Number of incidents reported along the whole period")
    
    ggplotly(p)
}
r_source <- py$sourcesName
r_news <- py$sourceNews

lollipop_function(r_source, r_news, "The most trustful online newspeper that report gun-violence incidents")
r_source <- py$sourcesLeastTrustful
r_news <- py$appearancesLeastTrusful

lollipop_function(r_source, r_news, "The least trustful online newspeper that report gun-violence incidents")

Chart Type: We want to see the level of crime for each state proportional to the size of the population. We have previously calcuated the respective percentages and they will be ploted below, also adding tooltips to offer more details for each state.

Overall Conclusions